328 research outputs found

    Social Ranking Techniques for the Web

    Full text link
    The proliferation of social media has the potential for changing the structure and organization of the web. In the past, scientists have looked at the web as a large connected component to understand how the topology of hyperlinks correlates with the quality of information contained in the page and they proposed techniques to rank information contained in web pages. We argue that information from web pages and network data on social relationships can be combined to create a personalized and socially connected web. In this paper, we look at the web as a composition of two networks, one consisting of information in web pages and the other of personal data shared on social media web sites. Together, they allow us to analyze how social media tunnels the flow of information from person to person and how to use the structure of the social network to rank, deliver, and organize information specifically for each individual user. We validate our social ranking concepts through a ranking experiment conducted on web pages that users shared on Google Buzz and Twitter.Comment: 7 pages, ASONAM 201

    Optimizing XML Compression

    Full text link
    The eXtensible Markup Language (XML) provides a powerful and flexible means of encoding and exchanging data. As it turns out, its main advantage as an encoding format (namely, its requirement that all open and close markup tags are present and properly balanced) yield also one of its main disadvantages: verbosity. XML-conscious compression techniques seek to overcome this drawback. Many of these techniques first separate XML structure from the document content, and then compress each independently. Further compression gains can be realized by identifying and compressing together document content that is highly similar, thereby amortizing the storage costs of auxiliary information required by the chosen compression algorithm. Additionally, the proper choice of compression algorithm is an important factor not only for the achievable compression gain, but also for access performance. Hence, choosing a compression configuration that optimizes compression gain requires one to determine (1) a partitioning strategy for document content, and (2) the best available compression algorithm to apply to each set within this partition. In this paper, we show that finding an optimal compression configuration with respect to compression gain is an NP-hard optimization problem. This problem remains intractable even if one considers a single compression algorithm for all content. We also describe an approximation algorithm for selecting a partitioning strategy for document content based on the branch-and-bound paradigm.Comment: 16 pages, extended version of paper accepted for XSym 200

    Factorization in Formal Languages

    Get PDF
    We consider several novel aspects of unique factorization in formal languages. We reprove the familiar fact that the set uf(L) of words having unique factorization into elements of L is regular if L is regular, and from this deduce an quadratic upper and lower bound on the length of the shortest word not in uf(L). We observe that uf(L) need not be context-free if L is context-free. Next, we consider variations on unique factorization. We define a notion of "semi-unique" factorization, where every factorization has the same number of terms, and show that, if L is regular or even finite, the set of words having such a factorization need not be context-free. Finally, we consider additional variations, such as unique factorization "up to permutation" and "up to subset"

    Patterns of Individual Shopping Behavior

    Get PDF
    Much of economic theory is built on observations of aggregate, rather than individual, behavior. Here, we present novel findings on human shopping patterns at the resolution of a single purchase. Our results suggest that much of our seemingly elective activity is actually driven by simple routines. While the interleaving of shopping events creates randomness at the small scale, on the whole consumer behavior is largely predictable. We also examine income-dependent differences in how people shop, and find that wealthy individuals are more likely to bundle shopping trips. These results validate previous work on mobility from cell phone data, while describing the unpredictability of behavior at higher resolution.Comment: 4 pages, 5 figure

    Composite repetition-aware data structures

    Get PDF
    In highly repetitive strings, like collections of genomes from the same species, distinct measures of repetition all grow sublinearly in the length of the text, and indexes targeted to such strings typically depend only on one of these measures. We describe two data structures whose size depends on multiple measures of repetition at once, and that provide competitive tradeoffs between the time for counting and reporting all the exact occurrences of a pattern, and the space taken by the structure. The key component of our constructions is the run-length encoded BWT (RLBWT), which takes space proportional to the number of BWT runs: rather than augmenting RLBWT with suffix array samples, we combine it with data structures from LZ77 indexes, which take space proportional to the number of LZ77 factors, and with the compact directed acyclic word graph (CDAWG), which takes space proportional to the number of extensions of maximal repeats. The combination of CDAWG and RLBWT enables also a new representation of the suffix tree, whose size depends again on the number of extensions of maximal repeats, and that is powerful enough to support matching statistics and constant-space traversal.Comment: (the name of the third co-author was inadvertently omitted from previous version

    A New View on Worst-Case to Average-Case Reductions for NP Problems

    Full text link
    We study the result by Bogdanov and Trevisan (FOCS, 2003), who show that under reasonable assumptions, there is no non-adaptive worst-case to average-case reduction that bases the average-case hardness of an NP-problem on the worst-case complexity of an NP-complete problem. We replace the hiding and the heavy samples protocol in [BT03] by employing the histogram verification protocol of Haitner, Mahmoody and Xiao (CCC, 2010), which proves to be very useful in this context. Once the histogram is verified, our hiding protocol is directly public-coin, whereas the intuition behind the original protocol inherently relies on private coins

    Dictionary-based methods for information extraction

    Get PDF
    In this paper, we present a general method for information extraction that exploits the features of data compression techniques. We first define and focus our attention on the so-called dictionary of a sequence. Dictionaries are intrinsically interesting and a study of their features can be of great usefulness to investigate the properties of the sequences they have been extracted from e.g. DNA strings. We then describe a procedure of string comparison between dictionary-created sequences (or artificial texts) that gives very good results in several contexts. We finally present some results on self-consistent classification problems

    Features of the Extension of a Statistical Measure of Complexity to Continuous Systems

    Full text link
    We discuss some aspects of the extension to continuous systems of a statistical measure of complexity introduced by Lopez-Ruiz, Mancini and Calbet (LMC) [Phys. Lett. A 209 (1995) 321]. In general, the extension of a magnitude from the discrete to the continuous case is not a trivial process and requires some choice. In the present study, several possibilities appear available. One of them is examined in detail. Some interesting properties desirable for any magnitude of complexity are discovered on this particular extension.Comment: 22 pages, 0 figure
    corecore